## put your code here
library(knitr)
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(broom)
This project is based on real complany data to analyze the relationships of Virtual Machine(VM) remote session usage with date time and geolocation of VM and remote clients. The company name remains anonymous in the project. Two data files: Remote session available on GitHub here and Region available on GitHub here.
The company sells and leases its storage products to customers globally. The audit agent installed on customer’s network allows secure transfer of the error and diagnostic information to be sent back to the Customer Support Center(remote session). Engineers then analyze these error reports and trouble shoot for customers.
The company’s system has been tracking when, where(with IP address), how much and who data transfer comes from. However there is no data analytic tool to analyze these information. The project intends to use R to do statistical computing, depict the remote session allocations graphically, identity capacity shortages, over allocated VMs, hotspot with different geo locations and date time.
With recent three months of remote session data(over one million records) from 1/31/2016 to 4/30/2016, the goal is to find out whether there is a trend that the number of allocated VM sessions, session duration, data transfer has correlation with geolocation and date time.
Check whether normal distribution is followed and whether linear model can be applied. Here are some key benefits:
The backbone data comes from the company’s database, which tracks remote sessions’ start/end time, installed agent’s IP address, the VM name located in different data center, how much data has been transferred and the session termination cause. The data is queried via SQL and extracted into CSV fommat(session.csv). A Java program is developed to call IP lookup web service and extract the regional information into CSV format(region.csv)
First load the session file(session.csv.gz), which is extracted from production DB and zipped.
There are over 1 million remote sessios in the past three months. Each row is correspond to each remote session initiated globally.
Here is a decription of the columns:
Session_ID: Unique identifier of each remote session
Login: Account name to initiate a remote session to access VM in the data center
Start_Time: The timestamp when a remote session starts
End_Time: The timestamp when a remote session ends
Agent_IP: IP address of the agent installed on the product
Termination_Cause: a lookup value of differnt typese of terminate. A normal termination cause value is 3.
Data_Transfer: the file transfer from the agent(client) to the VM(Server) in byte
Gas_Server_Name: One of the fifteen VMs loccated in five data centers(Cork in Europe, Hopkinton, Durham, Santa Clara in the US and Singapore in Asia). Each data center has three VMs.
Duration: The time span between Start_Time and End_Time in minutes
WeekofYear: The week number since January 1, 2016. For example, January 8, 2016 will be 2.
Dateofyear: The date number since January 1, 2016.For example, January 8, 2016 will be 8.
Weekday: Day of the week. For example: SUN, MON.
## put your code here
session <- read_csv(gzfile("~/Downloads/session.csv.gz"))
login_info<-session %>% group_by(LOGIN) %>%
summarize(login_count=n())
count(session)
## Source: local data frame [1 x 1]
##
## n
## (int)
## 1 1181760
head(login_info[order(-login_info$login_count),], n=1 )
## Source: local data frame [1 x 2]
##
## LOGIN login_count
## (chr) (int)
## 1 syr 304708
(head(login_info[order(-login_info$login_count),], n=1 )$login_count)*100 / (count(session))
## n
## 1 25.78425
The highest number of remote sessions was initated by a generic account: SYR, which stands for 25% of the total login.
Do data manipulation to separate the start time to year, month, day, hour, minute and second.
## put your code here
p<-session %>% filter(START_TIME!="null")
p$SDATE <- as.Date(p$START_TIME,"%m/%d/%Y")
p<-p %>% separate(SDATE, into=c("SYEAR", "SMON", "SDATE"),sep="-")
p$STIME<-format(as.POSIXct(strptime(p$START_TIME, "%m/%d/%Y %H:%M:%S")), format="%H:%M:%S")
p<-p%>% separate(STIME, into=c("SHOUR", "SMIN", "SSECOND"),sep=":")
Each data center has three VMs named from 01 to 03. Add a DC column mapped from GAS_Server_Name column.
## put your code here
DC <-c(esrgckprd01="Cork",esrgckprd02="Cork",esrgckprd03="Cork",esrghoprd01="Hopkinton", esrghoprd02="Hopkinton", esrghoprd03="Hopkinton", esrgscprd01="Santa Clara", esrgscprd02="Santa Clara", esrgscprd03="Santa Clara", esrgweprd01="Durham", esrgweprd02="Durham",esrgweprd03="Durham", esrgspprd01="Singapore", esrgspprd02="Singapore",esrgspprd03="Singapore")
p$DC <-DC[p$GAS_SERVER_NAME]
Next the trend of between the number of remote sessions and the dayofyear from 1/31/2016 to 4/30/2016. Use point size to denote the average data transfer size in MB.
## put your code here
p %>% group_by(DAYOFYEAR) %>%
summarize(login_count=n(),DATA_TRANSFER=mean(DATA_TRANSFER/1000000)) %>%
ggplot(aes(DAYOFYEAR, login_count, group=1)) +
geom_point(aes( size=DATA_TRANSFER)) + geom_smooth(color="red")+
xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("The number of remote sessions") +labs(title="Remote sessions on all VMs")
As a whole, the trend is gradually up with day in the past three months. Interestingly, there is a trend that more and more remote sessions were initiated over the weekend from 1/31/2016 to 4/30/2016. It appears that more and more people work over the weekend.
Next check what the trend during 24 hour cycle is:
## put your code here
p %>% group_by(SHOUR) %>%
summarize(login_count=n(),DATA_TRANSFER=mean(DATA_TRANSFER/1000000)) %>%
ggplot(aes(SHOUR, login_count, group=1)) +
geom_point(aes( size=DATA_TRANSFER)) + geom_smooth(color="red")+
xlab("24 hour in EST")+ylab("The number of remote sessions") +labs(title="Remote sessions around the hour")
It is observed that during 24 hour cycle, more emote sessions were initiated in the morning of EST. More data transfers happened between 8 PM towards midnight EST.
As remote sessions accessed one of the five data centers, which data center took the most loading in number of remote sessions?
## put your code here
p %>% group_by(DC) %>%
summarize(login_count=n()) %>%
ggplot(aes(DC, login_count, fill=DC)) +geom_bar(stat="identity")+
geom_hline(aes( yintercept = mean(login_count)))+
xlab("Data Centers")+ylab("The number of remote sessions") +labs(title="Remote sessions across Data Center")
It turns out the data center in Cork took most remote sessions. Then what about the average duration considering normal termination(termination cause ==3) and average data transferred across data centers?
## put your code here
p %>% filter(!is.na(DURATION) & (TERMINATION_CAUSE==3)) %>% group_by(DC) %>%
summarize(DURATION=mean(as.numeric(DURATION))) %>%
ggplot(aes(DC, DURATION, fill=DC)) +geom_bar(stat="identity")+
geom_hline(aes( yintercept = mean(DURATION)))+
xlab("Data Centers")+ylab("Average remote session durtaion in minutes") +labs(title="Average remote session durtaion across Data Center")
p %>% filter(!is.na(DATA_TRANSFER)) %>% group_by(DC) %>%
summarize(DATA_TRANSFER=mean(DATA_TRANSFER/1000000)) %>%
ggplot(aes(DC, DATA_TRANSFER, fill=DC)) +geom_bar(stat="identity")+
geom_hline(aes( yintercept = mean(DATA_TRANSFER)))+
xlab("Data Centers")+ylab("Average data transfer in MBs") +labs(title="Average data transfer across Data Center")
It appears the durations were evenly distributed acorss data centers while Singapore data center received the least data transfer.
The next step is to concentrate on the number of remote session across VMs in all data centers.
## put your code here
p %>% group_by(GAS_SERVER_NAME) %>%
summarize(login_count=n()) %>%
ggplot(aes(GAS_SERVER_NAME, login_count, fill=GAS_SERVER_NAME)) +geom_bar(stat="identity")+
geom_hline(aes( yintercept = mean(login_count)))+
xlab("VMs across Data Centers")+ylab("The number of remote sessions") +labs(title="Remote sessions across VMs")
p %>% group_by(DAYOFYEAR,DC) %>%
summarize(login_count=n()) %>%
ggplot(aes(x = DAYOFYEAR, y = login_count, color = DC)) + geom_line()+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("The number of remote sessions") +labs(title="The trend of remote sessions for data centers")
The result is consistent with the previous observation: Cork VMs were taking the most loading of remote sessions.
Now consider the geolocation of the client, which is determined by installed agent’s IP address.
First load region.csv.gz, which is generated by a Java program calling IP lookup web service for each IP address.
Then left join session to region so that each row contains both client and server information.
Look into the data based on the client country.
## put your code here
region <- read_csv(gzfile("~/Downloads/region.csv.gz"))
p<-left_join(p, region, by=c("AGENT_IP"="IP"))
login_info<-p %>% group_by(Country) %>%
summarize(login_count=n())
head(login_info[order(-login_info$login_count),], n=10 )
## Source: local data frame [10 x 2]
##
## Country login_count
## (chr) (int)
## 1 US 585486
## 2 DE 63767
## 3 GB 44149
## 4 FR 39580
## 5 BR 34879
## 6 JP 32798
## 7 IT 30022
## 8 ES 29664
## 9 AU 25195
## 10 SE 19644
login_info<-p %>% filter(Country=="US")%>% group_by(Region) %>%
summarize(login_count=n())
head(login_info[order(-login_info$login_count),], n=5 )
## Source: local data frame [5 x 2]
##
## Region login_count
## (chr) (int)
## 1 NY 71204
## 2 Unknown 70584
## 3 CA 56619
## 4 TX 42856
## 5 NJ 40497
The number of remote sessions requesting from the US is way higher than that from other countries. Within the US, NY state ranks first. Unknown region, ranking in the second place, means the IP address is internal or can not be determined.
## put your code here
p %>% group_by(Country) %>%
summarize(login_count=n()) %>% filter(login_count>10000) %>%
mutate(Country=reorder(Country,login_count)) %>%
ggplot(aes(Country, login_count, fill=Country)) +geom_bar(stat="identity")+xlab("Countries with more than 10,000 remote sessions from 1/31/216 to 4/30/2016")+ylab("The number of remote sessions") +labs(title="Remote session distribution")
Based on the graph, the number of remote sessions coming from US was way ahead of other countries.
Based on the previous obvervation, the remote sessions to Cork data center were loaded most. It appears a number of US remote sessions routed to Cork Data Center even though there are three US data centers, which is inefficient.
## put your code here
p %>% filter(DC=="Cork") %>%group_by(Country) %>%
summarize(login_count=n()) %>% filter(login_count>10000) %>%
mutate(Country=reorder(Country,login_count)) %>%
ggplot(aes(Country, login_count, fill=Country)) +geom_bar(stat="identity")+xlab("Countries with more than 10,000 remote sessions from 1/31/216 to 4/30/2016")+ylab("The number of remote sessions") +labs(title="Remote session distribution in Cork data center")
The graph above confirms: the number of remote sessions requesting from the US ranked the second among european countries for Cork data center.
Take another angle, of all the remote ressions coming from US, what about their distribution of the data centers?
## put your code here
p %>% filter(Country=="US") %>%group_by(DC) %>%
summarize(login_count=n()) %>%
ggplot(aes(DC, login_count, fill=DC)) +geom_bar(stat="identity")+
geom_hline(aes( yintercept = mean(login_count)))+xlab("Data Centers")+ylab("The number of remote sessions") +labs(title="Data center distribution of the remote sessions coming from the US")
Obviously the US data center in Santa Clara is not fully used while a number of US remote sessions are sent to Cork data center. Connection delay is expected. Based on the business knowlege, the agent uses its configured timezone to decide which data center to access. It appears some configurations are wrong.
Split the dta into groups by data center considering the number of remote sessions, duration and data transfer.
## put your code here
theme_set(theme_bw())
DataCenter<-c("Cork", "Hopkinton", "Durham","Santa Clara","Singapore")
p %>%
group_by(DAYOFYEAR, DC) %>%
summarize(login_count=n(),DATA_TRANSFER=mean(DATA_TRANSFER/1000000)) %>%
filter(DC %in% DataCenter) %>%
ggplot(aes(x = DAYOFYEAR, y = login_count, size=DATA_TRANSFER)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~DC, scales = "free")+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("Average number of remote sessions") +labs(title="Remote session trend across Data Centers")
#Duration across Data center
p %>% filter(TERMINATION_CAUSE==3) %>%
group_by(DAYOFYEAR, DC) %>%
summarize(DURATION=mean(DURATION)) %>%
filter(DC %in% DataCenter) %>%
ggplot(aes(x = DAYOFYEAR, y = DURATION)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~DC, scales = "free")+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("Duration of remote session in minutes") +labs(title="Duration of remote session on each Data Center")
#Data Transfer across Data Center
p %>%
group_by(DAYOFYEAR, DC) %>%
summarize(DATA_TRANSFER=mean(DATA_TRANSFER)/1000000) %>%
filter(DC %in% DataCenter) %>%
ggplot(aes(x = DAYOFYEAR, y = DATA_TRANSFER)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~DC, scales = "free")+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("Data Transfer in MBs ") +labs(title="Data transfer of remote session on each Data Center")
#Check the remote login coming from US across Data ceter
p %>% filter(Country=="US") %>%
group_by(DAYOFYEAR, DC) %>%
summarize(login_count=n()) %>%
filter(DC %in% DataCenter) %>%
ggplot(aes(x = DAYOFYEAR, y = login_count)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~DC, scales = "free")+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("Number of remote sessions") +labs(title="Remote login coming from US across Data center")
#Each data center has three VMs
VM <-c("esrgckprd01","esrgckprd02","esrgckprd03","esrghoprd01","esrghoprd02","esrghoprd03","esrgweprd01","esrgweprd02","esrgweprd03","esrgscprd01","esrgscprd02","esrgscprd03","esrgspprd01","esrgspprd02","esrgspprd03")
# check remote sessions across each VM
p %>% group_by(DAYOFYEAR, GAS_SERVER_NAME) %>%
summarize(login_count=n()) %>%
filter(GAS_SERVER_NAME %in% VM) %>%
ggplot(aes(x = DAYOFYEAR, y = login_count)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~GAS_SERVER_NAME, scales = "free")+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("Average number of remote sessions") +labs(title="Cork:ckxxx,Hop:hoxxx,Durham:wexxx,Santa Clara:scxxx,Singapore:spxxx")
The number of remote sessions increased consistently in the US data centers(Durham, Hopkinton and Santa Clara) and Singapore with data/time. The data center in Santa Clara and Sigapore were not fully used.
Duration of remote sessions in each data center reduces with data/time.
The remote sessions coming from the US were routed to Santa Clara in a faster pace than any other data center.
Check whether the number of remote session is in normal distribution.
## put your code here
x<-p %>%
group_by(DAYOFYEAR) %>%
summarize(login_count=n())
fit<-x %>% lm(login_count ~ DAYOFYEAR, data = .)
hist(x$login_count)
x %>%
ggplot(aes(DAYOFYEAR, login_count)) +
geom_point() +
geom_abline(intercept = fit$coef[1],
slope = fit$coef[2])+xlab("Day of Year from 1/31/2016 to 4/30/2016 except weekends")+ylab("Average number of remote sessions")
sd(x$login_count)
## [1] 4235.624
qqnorm(x$login_count)
qqline(x$login_count)
The Normal Q-Q plot shows the bulk number of remote sessions with day of year does not fit a good normal distribultion(big tails!).
From the previous observation, the remote session count over the weekend is way below that of weekday. Also the remote session count from the US is way higher than other contries. Check whether remote sessions are in normal distribution if these two variables are filtered.
x<-p %>%
filter(Country=="US"&(WEEKDAY!="SAT" & WEEKDAY !='SUN')) %>%
group_by(DAYOFYEAR) %>%
summarize(login_count=n())
fit<-x %>% lm(login_count ~ DAYOFYEAR, data = .)
hist(x$login_count)
x %>%
ggplot(aes(DAYOFYEAR, login_count)) +
geom_point() +
geom_abline(intercept = fit$coef[1],
slope = fit$coef[2])+xlab("Day of Year from 1/31/2016 to 4/30/2016 except weekends")+ylab("Average number of remote sessions from US")
sd(x$login_count)
## [1] 1078.45
qqnorm(x$login_count)
qqline(x$login_count)
x<-p %>%
filter(Country=="US"&(WEEKDAY!="SAT" & WEEKDAY !='SUN')&(DC=="Hopkinton"|DC=="Durham"|DC=="Santa Clara")) %>%
group_by(DAYOFYEAR) %>%
summarize(login_count=n())
fit<-x %>% lm(login_count ~ DAYOFYEAR, data = .)
hist(x$login_count)
x %>%
ggplot(aes(DAYOFYEAR, login_count)) +
geom_point() +
geom_abline(intercept = fit$coef[1],
slope = fit$coef[2])+xlab("Day of Year from 1/31/2016 to 4/30/2016 except weekends")+ylab("Average number of remote sessions from US")
sd(x$login_count)
## [1] 1027.469
qqnorm(x$login_count)
qqline(x$login_count)
#mean(x$login_count) + c(-1,1)*qnorm(0.975)*sd(x$login_count)
x<-p %>%
filter(Country!="US"&(WEEKDAY!="SAT" & WEEKDAY !='SUN')) %>%
group_by(DAYOFYEAR) %>%
summarize(login_count=n())
fit<-x %>% lm(login_count ~ DAYOFYEAR, data = .)
hist(x$login_count)
x %>%
ggplot(aes(DAYOFYEAR, login_count)) +
geom_point() +
geom_abline(intercept = fit$coef[1],
slope = fit$coef[2])+xlab("Day of Year from 1/31/2016 to 4/30/2016 except weekends")+ylab("Average number of remote sessions not from US")
sd(x$login_count)
## [1] 1346.771
qqnorm(x$login_count)
qqline(x$login_count)
The regression line shows the number of remote session from the US increases but the number of from non-US countries decreases.
Once filter the remote session with proper geolation, normal Q-Q plots confirms that the linear models fits the data well so that the trend is correctly predicted.
The whole data set is not in perfect normal distribution. However once the remote session is filtered by the source(Country) and from weekend data(because the remote session count is much lower than weekdays), the filtered data is in normal distribution and confirmed by Normal Q-Q plot.
Cork’s data center is overloaded compared to other data centers. Santa Clara and Singapore data centers are under load in terms of remote sessions.
A number of agents, because their timezone settings are not configured correctly, are routing to the data centers far away from their geolication. For example, a number of remote sesions coming from the US is routed to Cork or Singapore, even though the data center in Santa Clara has enough capacity.
The general trend of the number of remote sessions rises gradually with date time. The number of remote sessions rises dramatically over the weened although it is much lower than weekdays. The duration of remote session to each data center reduces consistently.
The number of remote sessions coming from the US is way ahead of that of other countries. While the number of remote session from the US increases, the number of from non-US countries decreases.